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Abstract 

This paper considers the problem of reasoning on massive amounts of (possibly distributed) 
data. Presently, existing proposals show some limitations: (i) the quantity of data that can 
be handled contemporarily is limited, due to the fact that reasoning is generally carried out 
in main-memory; (ii) the interaction with external (and independent) DBMSs is not trivial 
and, in several cases, not allowed at all; (Hi) the efficiency of present implementations 
is still not sufficient for their utilization in complex reasoning tasks involving massive 
amounts of data. This paper provides a contribution in this setting; it presents a new 
system, called DLV DS , which aims to solve these problems. Moreover, the paper reports 
the results of a thorough experimental analysis we have carried out for comparing our 
system with several state-of-the-art systems (both logic and databases) on some classical 
deductive problems; the other tested systems arc: LDL++, XSB, Smodels and three top- 
level commercial DBMSs. D\N DB significantly outperforms even the commercial Database 
Systems on recursive queries. 

To appear in Theory and Practice of Logic Programming (TPLP). 

KEYWORDS: Deductive Database Systems, Answer Set Programming / Declarative Logic 
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1 Introduction 

The problem of handling massive amounts of data received much attention in the 
research related to the development of efficient Database Management Systems 
(DBMSs). In this scenario, a mounting wave of data intensive and knowledge based 
applications, such as Data Mining, Data Warehousing and Online Analytical Pro- 
cessing has created a strong demand for more powerful database languages and 
systems. An important effort in this direction has been carried out with the intro- 
duction of the latest standard for SQL, namely SQL99 (SQL 1999) which provides, 
among other features, support to object oriented databases and recursive queries. 

However, the adoption of SQL99 is still far from being a "standard"; in fact 
almost all current DBMSs do not fully support SQL99 and, in some cases, they 
adopt proprietary (non standard) language constructs and functions to implement 
parts of it. Moreover, the efficiency of current implementations of SQL99 constructs 
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and their expressiveness are still not sufficient for performing complex reasoning 
tasks on huge amounts of data. 

The needed expressiveness for reasoning tasks can be provided by logic-based sys- 
tems. In fact, declarative logic programming provides a powerful formalism capable 
of easily modelling and solving complex problems. The recent development of effi- 
cient logic-based systems like DLV (jLeone et al. 2004]) . Smodels (jNiemela et al. 2000)) . 
XSB (|Rao et al. 19971) . ASS AT (ILin and Zhao 20021 ILin and Zhao 20041) . Cmod- 
els dGiunchiglia et al. 2004| |Giunchiglia et al. 2006|), CLASP (jGebser et al. 2007|) . 



etc., has renewed the interest in the area of non-monotonic reasoning and declara- 
tive logic programming for solving real world problems in a number of application 
areas. However, "data intensive" problems can not be handled in a typical logic 
programming system working in main-memory. 

In the past, Deductive Database Systems (DDS) have been proposed to combine 
the expressive power of logic-based systems with the efficient data management of 
DBMSs (|Arni et al. 2003|IGallaire et al. 1984|ICeri et al. 1990|IGrant and Minker 1992[) : 
basically, they are an attempt to adapt typical Datalog systems, which have a 
"smalldata" view of the world, to a "largedata" view of the world via intelligent 
interactions with some DBMSs. Recently emerging application contexts such as the 
ones arising from the natural recursion across nodes in the Internet, or from the 
success of intrinsically recursive languages such as XML (Winslctt 2006), renewed 
the interest in such kinds of systems (jAbiteboul et al. 2005|ILoo et al. 2005[) . 

However, the main limitations of currently existing DDSs reside both in the fact 
that reasoning is still carried out in main-memory - this limits the amount of data 
that can be handled - and in the limited interoperability with generic, external, 
DBMSs they provide. In fact, generally, the reasoning capabilities of these systems 
are tailored on a specific (either commercial or ad- hoc) DBMS. 

Summarizing: (i) Database systems are nowadays robust and flexible enough to 
efficiently handle large amounts of data, possibly distributed; however, their query 
languages are not sufficiently expressive to support reasoning tasks on such data. 
(ii) Logic-based systems are endowed with highly expressive languages, allowing 
them to support complex reasoning tasks, but they work in main-memory and, 
hence, can only handle limited amounts of data. (Hi) Deductive database systems 
allow to access and manage data stored in DBMSs, however they perform their 
computations mainly in main-memory and provide limited interoperability with 
external (and possibly distributed) DBMSs. 

This work provides a contribution in this setting, bridging the gap between 
logic-based DDSs and DBMSs. It presents a new system, named DLV DS , which 
is logic-based (like a DDS) but can do all the work in mass-memory and, in prac- 
tice, does not have any limitation in the dimension of input data; moreover, it is 
capable to exploit optimization techniques both from DBMS (e.g., join orderings 
(Garcia-Molina et al. 2000)) and DDS theory (e.g., magic sets (Beeri and Ramakrisnhan 1991; 
IMumick et al. 1996ft ) . 

DLV DB allows for two typologies of execution: (i) direct database execution, 
which evaluates logic programs directly on database, with a very limited usage 
of main-memory but with some limitations on the expressiveness of the queries, 
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and (ii) main-memory execution, which loads input data from different (possibly 
distributed) databases and executes the logic program directly in main-memory. 
In both cases, interoperation with databases is provided by ODBC connections; 
these allow handling, in a quite simple way, data residing on various databases 
over the network. In order to avoid possible confusion, in the following we use the 
symbol DLV DB to indicate the whole system when the discussion is independent 
of the execution modality; however, when it is needed to distinguish between the 
two execution modalities, we use the symbol DLV /0 to indicate the main-memory 
execution, whereas the symbol DLV DB to indicate the direct database execution. 

Summarizing, the overall contributions of this work are the following: (i) The de- 
velopment of a fully fledged system enhancing in different ways the interactions be- 
tween logic-based systems and DBMSs. (ii) The development of an efficient, purely 
database-oriented, evaluation strategy for logic programs which minimizes the usage 
of main-memory and maximizes the advantages of optimization techniques imple- 
mented in existing DBMSs. (Hi) The definition of a framework for carrying out an 
experimental comparative analysis of the performance of state-of-the-art systems 
and DIN DB . (iv) The execution of a thorough experimentation which shows that 
DLV DB beats, often with orders of magnitude, Logic-Based Systems (LDL++, XSB, 
and Smodelsu) and even commercial DBMSs both for running times and amount of 
handled data on classical deductive problems (Bancilhon and Ramakrishnan 1988). 

The work is organized as follows. Section [2] presents the reasoning language sup- 
ported by the system, whereas Section [3] describes the functionalities it provides. 
In Section |4] the main implementation principles adopted in the development of 
DLV DB are discussed and Section [5] illustrates its general architecture. Section [6] 
first presents an overview of the state-of-the-art systems related to ~D\N DB , then it 
describes the experimental analysis we have carried out to compare DLV^ 5 with 
these systems on classical DDS problems. Finally, in Section [7] we draw our conclu- 
sions. 

2 The reasoning language of the system 

In this section we briefly describe the syntax and the semantics of the reasoning 
language adopted by the DLV DB system. This is basically Disjunctive Logic Pro- 
gramming (DLP) with Aggregate functions under the Answer Set Semantics; we 
refer to this language as DLP-^ in the following. The interested reader can find all 
details about DLP- 4 in (jFaber et al. 2004|) . 

Before starting the presentation, it is worth pointing out that the direct database 
execution modality supports only a strict subset of the reasoning language sup- 
ported by the main-memory execution. In particular, while DLV 7 ° supports the 
whole language of DLV (including disjunction, unlimited negation, and stratified 



It is worthwhile noting that, since benchmark programs are stratified, they are completely solved 
by the grounding layer of Smodels (LParse). This is the reason why we have not experimented 
also with ASSAT, Cmodels, and CLASP, as they also use LParse for grounding. 
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aggregates), DLV DB supports or- free programs with stratified negation and aggre- 
gates. 

2. 1 Syntax 

We assume that the reader is familiar with standard DLP; we refer to atoms, 
literals, rules, and programs of DLP, as standard atoms, standard literals, standard 
rules, and standard programs, respectively. For further background, see (jBaral 20021 
IGelfond and Lifschitz 1991]) . 

Set Terms. A (DLP -4 ) set term is either a symbolic set or a ground set. A symbolic 
set is a pair { Vars : Conj}, where Vars is a list of variables and Conj is a conjunction 
of standard atomsd A ground set is a set of pairs of the form (t: Conj), where t 
is a list of constants and Conj is a ground (variable free) conjunction of standard 
atoms. 

Aggregate Functions. An aggregate function is of the form f(S), where S is a set 
term, and / is an aggregate function symbol. Intuitively, an aggregate function can 
be thought of as a (possibly partial) function mapping multisets of constants to a 
constant. 

The aggregate functions which are currently supported are: #count (number 
of terms), #sum (sum of non- negative rational numbers), #min (minimum term, 
undefined for empty set), #max (maximum term, undefined for empty set), #avg 
(average of non-negative rational numbers jE 

Aggregate Literals. An aggregate atom is f(S) -< T, where f(S) is an aggregate 
function, -<£ {=, <, <,>,>} is a predefined comparison operator, and T is a 
term (variable or constant) referred to as guard. 

An example of aggregate atom is: #max{Z : r(Z),a(Z, V)} > Y. 

An atom is either a standard (DLP) atom or an aggregate atom. A literal L is 
an atom A or an atom A preceded by the default negation symbol not; if A is an 
aggregate atom, L is an aggregate literal. 
DLP- 4 Programs. A DLP' 4 rule r is a construct 

fliv ■•■ va„ :- bi,---,b k , not b k+1 ,---, not b m . 
where a\, . . . ,a n are standard atoms, b±, ■ ■ ■ , b m are atoms, n > 0, and m > k > 0. 
The disjunction a\ v • • • va n is referred to as the head of r whereas the conjunction 
&i,...,6fc, not bk+i, ...,not b m is the body of r. We denote the set {a\, . . . ,a n } of 
the head atoms by H(r), and the set {b\, ...,bk, not bk+i, ...,not b m } of the body 
literals by B(r). B + {r) and B~(r) denote, respectively, the set of positive literals 
and the set of negative literals occurring in B(r). 

A DLP- 4 program V is a set of DLP- 4 rules. 

Note that DLP- 4 allows also for built-in predicates (jFaber an d Pfcifc r~996j) in 
its rules, such as the comparative predicates equality, less-than, and greater-than 
(=, <, >) and arithmetic predicates like addition or multiplication (+, *). 

2 Intuitively, asymbolicset {X:a(X, Y), p(Y)} stands for the set of X-valucs making a(X, Y), p(Y) 
true, i.e., {X\3Ys.t. a(X,Y),p(Y) is true}. 

3 The first two aggregates correspond, respectively, to the cardinality and weight constraint literals 
of Smodels. 



Theory and Practice of Logic Programming 



5 



Safety. A global variable of a rule r is a variable appearing in a standard atom of 
r; all other variables are local variables. A rule r is safe if the following conditions 
hold: (i) each global variable of r appears in a positive standard literal in the body 
of r; (ii) each local variable of r appearing in a symbolic set { Vars : Conj} appears 
in an atom of Conj; (Hi) each guard of an aggregate atom of r is a constant or a 
global variable. A program V is safe if all r G V are safe. In the following we assume 
that DLP" 4 programs are safe. 

Let the level mapping of a program V be a function | | from the predicates in V to 
finite ordinals; moreover, given an atom A = p{t\, . . . ,t n ), we denote by Pred(A) 
its predicate p. 

Stratified! 101 programs. A DLP" 4 program V is called stratified! 101 ( |Apt et al. 1988 
|Przymusinski 1988] ), if there is a level mapping || || s of V such that, for every rule 
r: (1) for any I G B+(r), and for any I' G H(r), ||Pred(7)|| s < \\Pred(l')\\ s ; (2) 
for any I £ B~(r), and for any I' G H(r), ||Pre<2(7)|| s < ||Pred(7')|| s ; (3) for any 
1,1' eH(r),\\Pred(l)\\ s = \\Pred(l')\\ s . 

Stratified a " r programs. A DLP" 4 programV is called stratified a " r (jDell'Armi et al. 2003b]) 
if there is a level mapping || || a of V such that, for every rule r: (1) if I appears 
in the head of r, and V appears in an aggregate atom in the body of r, then 
\\Pred(l')\\ a < ||Pred(7)|| a ; and (2) if I appears in the head of r, and /' occurs in a 
standard atom in the body of r, then ||Pred(7')|| a < ||Pred(7)|| a . (3) If both I and 
I' appear in the head of r, then ||Pred(Z')|| = ||Pred(Z)|| . 

Example 2.1 

Consider the program consisting of a set of facts for predicates a and b, plus the 
following two rules: 

q(X) :-p(X), #count{F : a(Y, X),b(X)} < 2. p(X) :- q(X),b(X). 

The program is stratified aff9r , as the level mapping \a\ \ — \ \b\ \ — 1, | \p\ \ = | \q\ | = 2 
satisfies the required conditions. If we add the rule b(X) :-p(X), then no level- 
mapping exists and the program becomes not stratified 0£,9r . □ 

Intuitively, the property stratified aff£,r forbids recursion through aggregates. 
Supported languages. The direct database execution modality (DIN DB ) currently 
supports only DLP' 4 programs which are disjunction free, stratified™ ' , and strat- 
ified' 1091 '. Note that both built-in predicates and aggregates are supported. 

Conversely, the main-memory execution modality (DLV /0 ) supports all DLP -4 
programs that are stratified 0901 *. As a consequence, unrestricted negation, disjunc- 
tion and non recursive aggregates are supported. 

2.2 Answer Set Semantics 

Universe and Base. Given a DLP -4 program V, let U-p denote the set of constants 
appearing in V, and B-p be the set of standard atoms constructible from the (stan- 

—X 

dard) predicates of V with constants in U-p. Given a set X, let 2 denote the set 
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of all multisets over elements from X. Without loss of generality, we assume that 
aggregate functions map to Q (the set of rational numbers). 

Instantiation. A substitution is a mapping from a set of variables to Up. A sub- 
stitution from the set of global variables of a rule r (to Up) is a global substitu- 
tion for r; a substitution from the set of local variables of a symbolic set 5 (to 
U-p) is a local substitution for 5. Given a symbolic set without global variables 
5 = {Vars : Conj}, the instantiation of 5 is the following ground set of pairs 
inst(S): {{'y( Vars) : j(Conj)) 7 is a local substitution for 5}o 
A ground instance of a rule r is obtained in two steps: (1) a global substitution 
a for r is first applied over r; (2) every symbolic set 5 in er(r) is replaced by its 
instantiation inst(S). The instantiation Ground(V) of a program V is the set of all 
possible instances of the rules of V . 

Example 2.2 

Consider the following program V\ : 

?(l)vp(2,2). g(2)vp(2,l). t(X) :- q(X), #sum{Y : p(X, Y)} > 1. 

The instantiation Ground(V\) is the following: 

<j(l)vp(2,2). f(l) :-9(l),#suin{(l:p(l,l)>,<2:p(l,2))}>l. 
<?(2)vp(2,l). *(2):- 9 f(2),#Buin{<l:p(2,l)>,<2:p(2,2))}>l. □ 

Interpretations. An interpretation for a DLP" 4 program V is a set of standard 
ground atoms, that is I C Bp. A positive literal A is true w.r.t. I if A 6 7, is false 
otherwise. A negative literal not A is true w.r.t. 7, if A £ I, it is false otherwise. 

An interpretation also provides a meaning for aggregate literals. 

Let I be an interpretation. A standard ground conjunction is true (resp. false) 
w.r.t. 7 if all its literals are true. The meaning of a set, an aggregate function, and 
an aggregate atom under an interpretation, is a multiset, a value, and a truth- value, 
respectively. Let f(S) be a an aggregate function. The valuation I(S) of 5 w.r.t. I 
is the multiset of the first constant of the elements in 5 whose conjunction is true 
w.r.t. I. More precisely, let 7(5) denote the multiset \t\ \ {tx,...,t n : Conj) E SA 
Conj is true w.r.t. I]. The valuation I(f(S)) of an aggregate function f(S) w.r.t. 
7 is the result of the application of / on 7(5). If the multiset 7(5) is not in the 
domain of /, 7(/(5)) = _L (where _L is a fixed symbol not occurring in V). 

An instantiated aggregate atom A = f(S) -< k is true w.r.t. I if: (i) 7(/(5)) ^ ±, 
and, (ii) 7(/(5)) -< k holds; otherwise, A is false. An instantiated aggregate literal 
not A = not(/(5) -< k) is true w.r.t. I if: (i) 7(/(5)) / L, and, (ii) 7(/(5)) -< k 
does not hold; otherwise, A is false. 

Minimal Models. Given an interpretation 7, a rule r is satisfied w.r.t. I if some head 
atom is true w.r.t. 7 whenever all body literals are true w.r.t. 7. An interpretation 
M is a model of a DLP" 4 program V if all r £ Ground{V) are satisfied w.r.t. M. A 
model M for V is (subset) minimal if no model N for V exists such that N C M. 

4 Given a substitution a and a DLP-'* object Obj (rule, set, etc.), we denote by a(Obj) the object 
obtained by replacing each variable X in Obj by a(X). 
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Answer Sets. We now recall the generalization of the Gelfond-Lifschitz transforma- 
tion to programs with aggregates from (jFaber et al. 2 004). 

Definition 2.3 ( jFaber et al. 200$ ) 

Given a ground DLP -4 program V and a total interpretation I , let V 1 denote the 
transformed program obtained from V by deleting all rules in which a body literal 
is false w.r.t. I. I is an answer set of a program V if it is a minimal model of 
Ground(V) . 



Example 2.4 

Consider the following two programs: 
Pi : {p(a) :- #count{X : p(X)} > 0.} P 2 : {p(a) :- #count{X : p(X)} < 1.} 

Ground(Pi) — {p(a) :- #count{(a : p(a))} > 0.} and Ground(P 2 ) — {p(a) 
:-#count{(a : p(a))} < 1.}; consider also interpretations Jj = {p(a)} and I 2 = 0. 
Then, Ground(P 1 ) h = Ground(P 1 ), Ground(P 1 ) 1 ' 2 = 0, and Ground(P 2 ) 11 = 0, 
Ground(P2) 12 — GroundiP^) hold. 1 2 is the only answer set of Pi (because 1\ is 
not a minimal model of Ground(P\) 11 ), whereas P 2 admits no answer set (Pi is 
not a minimal model of Ground(P 2 ) 11 , and I 2 is not a model of Ground(P 2 ) = 
Ground(P 2 ) h ). □ 

Note that any answer set A of V is also a model of V because Ground(V) A C 
Ground(V), and rules in Ground(V) — Ground(V) A are satisfied w.r.t. A. 



3 System Functionalities 

As pointed out in the Introduction, the presented system allows for two typologies 
of execution: (i) direct database execution (DIN DB ), which is capable of handling 
massive amounts of data but with some limitations on the expressiveness of the 
query program (see Section [2]), and (ii) main- memory execution (DLV /0 ) which 
allows the user to take full advantage of the expressiveness of DLP* 4 and to import 
data residing on DBMSs, but with some limitations on the quantity of data to 
reason about, given by the amount of available main-memory. 

The system, along with a manual and some examples, is available for download at 
the address http: //www. mat .unical . it/terracina/dlvdb . In the following we provide 
a general description of the main functionalities provided by F>LV DB and DLV . 
The interested reader can find all details on the system's web site. 



3.1 Direct Database Execution 

Three main peculiarities characterize the F>LV DB system in this execution modality: 
(i) its ability to evaluate logic programs directly and completely on databases with 
a very limited usage of main-memory resources, (ii) its capability to map program 
predicates to (possibly complex and distributed) database views, and (Hi) the pos- 
sibility to easily specify which data is to be considered as input or as output for 
the program. This is the main contribution of our work. 
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Auxiliary-Directives : := Init-section [Table-def inition] + [Query-Section]? [Final-section]* 
Init-Section ::=USEDB DatabaseName :UserName : Password [System-Like]?. 
Table-definition ::= 

[USE TableName [( AttrName [, AttrName]* )]? [AS ( SQL-Statement )]? 

[FROM DatabaseName : UserName : Password] ? 

[MAPTO PredName [( SqlType [, SqlType] * )]? ]?. 

I 

CREATE TableName [( AttrName [, AttrName]* )]? 

[MAPTO PredName [( SqlType [, SqlType]* )]? ]? 

[KEEP_AFTER_EXECUTION] ? . ] 
Query-Section ::= QUERY TableName. 
Final-section ::= 

[DBOUTPUT DatabaseName : UserName : Password . 

I 

OUTPUT [APPEND I OVERWRITE]? PredName [AS AliasName]? 
[IN DatabaseName : UserName : Password . ] 
System-Like ::= LIKE [POSTGRES I ORACLE I DB2 I SQLSERVER I MYSQL] 

Fig. 1. Grammar of the auxiliary directives. 

Roughly speaking, in this execution modality the user has his data stored in 
(possibly distributed) database tables and wants to carry out some reasoning on 
them; however the amount of such data, or the number of facts that are generated 
during the reasoning, is such that the evaluation can not be carried out in main- 
memory. Then, the program must be evaluated directly in mass-memory. 

In order to properly carry out the evaluation, it is necessary to specify the map- 
pings between input and output data and program predicates, as well as to specify 
wether the temporary relations possibly needed for the mass-memory evaluation 
should be maintained or deleted at the end of the execution. These can be speci- 
fied by some auxiliary directives. The grammar in which these directives must be 
expressed is shown in Figure [TJ 

Intuitively, the user must specify the working database in which the system has to 
perform the evaluation. Moreover, he can specify a set of table definitions; note that 
each specified table is mapped into one of the program predicates. Facts can reside 
on separate databases or they can be obtained as views on different tables. Attribute 
type declaration is needed only for a correct management of built-in predicates. The 
USE and CREATE options can be exploited to specify input and output data as well 
as temporary relations needed for the mass memory instantiation. Finally, the user 
can choose to copy the entire output of the evaluation or parts thereof in different 
databases. 

Example 3.1 

Assume that a travel agency asks to derive all the destinations reachable by an 
airline company either by using its aircrafts or by exploiting code-share agree- 
ments. Suppose that the direct flights of each company are stored in a relation 
f light_rel(Id, FromX, ToY, Company) of the database dbAirports, whereas the 
code-share agreements between companies are stored in a relation codeshare_rel 
(Companyl, Company2, Flightld) of an external database dbCommercial; if a 
code-share agreement holds between the company cl and the company c2 for 
flightld, it means that the flight flightld is actually provided by an aircraft of 
cl but can be considered also carried out by c2. Finally, assume that, for security 
reasons, travel agencies are not allowed to directly access the databases dbAirports 
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USEDB dlvdb :myname :mypasswd. 

USE flight_rel (Id, FromX, ToY, Company) FROM dbAirports : airportUser : airportPasswd 
MAPTO flight (integer, varchar (255) , varchar (255) , varchar (255) ) . 

USE codeshare_rel (Companyl, Company2, Flightld) FROM dbCommercial : commUser : commPasswd 
MAPTO codeshare (varchar (255) , varchar(255) , integer). 
CREATE destinations.rel (FromX, ToY, Company) 

MAPTO destinations (varchar (255) , varchar(255) , varchar(255)) KEEP_AFTER_EXECUTION . 
OUTPUT destinations AS composedCompanyRoutes IN dbTravelAgency : agencyName : agencyPasswd. 

Fig. 2. Auxiliary directives for Example 13.21 

and dbCommercial, and, consequently, it is necessary to store the output result in 
a relation composedCompanyRoutes of a separate database dbTravelAgency sup- 
posed to support travel agencies. The DLP- 4 program that can derive all the con- 
nections is: 

(1) destinations(FromX,ToY,Comp) :- flight(Id, FromX, ToY, Comp). 

(2) destinations(FromX, ToY, Comp) :- flighted, FromX, ToY,C2), 

codeshare(C2, Comp, Id). 

(3) destinations(FromX, ToY, Comp) :- destinations(FromX,T2,Comp), 

destinations (T2, ToY, Comp). 

In order to exploit data residing in the above mentioned databases, we should 
map the predicate flight to the relation f light_rel of dbAirports and the pred- 
icate codeshare to the relation codesharejrel of dbCommercial. Finally, we have 
to map the predicate destinations to the relation composedCompanyRoutes of 
dbTravelAgency. 

Now suppose that, due to a huge size of input data, we need to evaluate the 
program in mass-memory (on a DBMS). In order to carry out this task, the auxiliary 
directives shown in Figure [2] should be used. They allow to specify the mappings 
between the program predicates and the database relations introduced previously. 

□ 

It is worth pointing out that if a predicate is not explicitly mapped into a table, 
but a relation with the same name and compatible attributes is present in the 
working database, the system automatically hypothesize a USE mapping for them. 
Analogously, if a predicate is not explicitly mapped and no corresponding table 
exists in the working database, a CREATE mapping is automatically hypothesized 
for it. This significantly simplifies the specification of the auxiliary directives; in 
fact, in the ideal case - when everything is in the working database and each 
input predicate has the corresponding input table with the same name - only the 
Init-Section and one of CREATE or OUTPUT options are actually needed to run a 
program and check its output. 

3.2 Main-Memory Execution 

The main-memory execution modality of the system allows input facts to be (pos- 
sibly complex) views on database tables and allows exporting (parts of) predicates 
to database relations. However, the program evaluation is carried out completely 
in main-memory; this allows the system to evaluate more complex logic programs 
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(see Section [2]) but at the price of a lower amount of data the system can handle, 
due to the limited amount of main-memory. 

The concept of importing and exporting data from external data sources into 
logic-based systems is not new (see, for example (jArni et al. 20031 ILu et al. 19961 
IRao et al. 1997|0 ; the contribution of this execution modality is mainly of techno- 
logical relevance and has the merit of providing Answer Set Programming with an 
easy way to access distributed data spread over the network. Another advancement 
w.r.t. existing proposals is its flexibility in the types of external source that can 
be accessed; in fact, most of the existing systems are tailored on custom DBMSs, 
whereas our system can be interfaced with any external source which provides an 
ODBC connection. 

Intuitively, DLV can be exploited when the user has to perform very complex 
reasoning tasks (in the NP class or higher) but the data is available in database 
relations, or the output must be permanently stored in a database for further elab- 
orations. 

In order to perform these tasks, two built-in commands are added in DLV 7 to 
the standard DLP -4 syntax, namely the ^import and the #export commands: 

^import (databasename, "username" , "password" , "query" ,predname, typeConv) . 
#export(databasename, "username" , "password" ,predname,tablename) . 

An ^import command retrieves data from a table "row by row" through the 
query specified by the user in SQL and creates one atom for each selected tuple. 
The name of each imported atom is set to predname, and is considered as a fact of 
the program. 

The ^export command generates a new tuple into tablename for each new truth 
value derived for predname by the program evaluation. 

An alternative form of the #export command is the following: 

^export (databasename, "username", "password", predname, tablename, 

"REPLACE where <condition>" ) 

which can be used to remove from tablename the tuples of predname for which the 
"REPLACE where" condition holds; it can be useful for deleting tuples correspond- 
ing to violated integrity constraints. 

It is worth pointing out that if a DLP- 4 program contains at least one ^export 
command, the system can compute only the hrst valid answer set; this limitation 
has been introduced mainly to avoid an exponential space complexity of the system. 
In fact, the number of answer sets can be exponential in the input. 

Example 3.2 

Consider again the scenario introduced in Example l3.1i and assume that the amount 
of input data allows the evaluation to be carried out in main-memory. The built-in 
commands that must be added to the DLP - ^ program of Example l3.1l to implement 
the necessary mappings are: 

#import(dbAirports, "airportUser" , "airportPasswd" , "SELECT * FROM fhght_rel" , 

flight, type : UJNT, Q.CONST, Q.CONST, Q.CONST). 
#import(dbCommercial, "commUser", "commPasswd" , "SELECT * FROM codeshare_rel" , 

codeshare, type : Q.CONST, Q.CONST, UJNT). 
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#export(dbTravelAgency, "agencyName" , "agency Passwd" , destinations, 

composedCompanyRoutes) . 

□ 

Note that the syntax of DLV 7 ° directives is simpler than that of DUV DB auxiliary 
directives. This is because DLV is intended to provide an easy mechanism to 
load data into the logic program and then store its results back to mass-memory, 
whereas DLV DB is oriented to more sophisticated applications handling distributed 
data and mass-memory-based reasoning and, consequently, it must provide a richer 
set of options in defining the mappings. 

4 Implementation principles 

The main innovation of our system resides in the evaluation of DLP- 4 programs 
directly on a database. The evaluation process basically consists of two steps: (i) 
the translation of DLP-^ rules in SQL statements, (ii) the definition of an efficient 
SQL query plan such that the computed answers are the same as the ones of the 
main-memory execution, but where the evaluation process is completely carried out 
in mass-memory. In the following, we first describe the general philosophy of our 
mass-memory evaluation strategy, then we present the algorithms used to obtain 
SQL statements from DLP- 4 rules. 

4-1 General characteristics of the evaluation strategy 

The evaluation of a program V starts from the analysis of its intensional compo- 
nent. In particular, V is first transformed into an equivalent program V 1 which 
can be evaluated more efficiently by the subsequent steps. Transformations car- 
ried out in this phase take into account various aspects of the input program; as 
an example, they aim to (i) reduce the arity of intermediate relations whenever 
possible, (ii) reduce the size of intermediate relations (jFaber et al. 1999a) . (Hi) 
push down constants in the queries by magic-sets rewritings (jBancilhon et al. 1986| 
IBeeri and Ramakrisnhan 1991| IMumick et al. 19961 IRoss 19 90). etc. All these op- 
timizations do not take into account the extensional component (the facts) of V\ 
some other optimizations are described in (jFaber et al. 19 99a). 

After this, the connected components and their topological order (i.e., the De- 
pendency Graph) of the resulting program are computed. Then, it is evaluated one 
component at a time, starting from the lowest ones in the topological order. 

The evaluation of each component follows the Semi-Naive method IjUllman 1989) 
with the enhancements showed in (Balbin and Ramamohanarao 1987; Zaniolo et al. 1997) 
to optimize the evaluation of rules with non-linear recursion. 

In particular, the Semi-Naive algorithm applied to a component Vc can be viewed 
as a two-phase algorithm: the first one deals with non-recursive rules, which can be 
completely evaluated in one single step; the second one deals with recursive rules 
which need an iterative fixpoint computation for their complete evaluation. At each 
iteration there are a number of predicates whose extensions have been already fully 
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determined (predicates not belonging to Vc which have been therefore previously 
evaluated), and a number of recursive predicates (i.e., belonging to Vc) for which 
a new set of truth values can be determined from the available ones. Then, in order 
to evaluate, e.g., the rule: 

(n) : Po(X,Y) :- Pl (X,Y),p 2 (Y,Z), q (X,Z). 

where p\ and P2 are mutually recursive with po and q is not recursive, the standard 
Semi-Naive method evaluates the following formula (expressed in relational algebra) 
at each iteration: 

AP fe = AP-f -1 x P*- 1 ex Q U (a) 
P*" 1 ixi AP*- 1 ex Q (b) 

Here, a capital letter is used to indicate the database relation corresponding to the 
(lower case) predicate; Pj° indicates the values stored in relation Pi up to step k 
and AP^ is the set of new values determined for Pj at step k (in the following, we 
call AP* the differential of Pj ) . 

However, the standard Semi-Naive approach is characterized by inefficiencies in 
evaluating non-linear recursive rules. In fact, if each Pj° 1 is expanded in its (dis- 
joint) components P*f~ 2 and AP^ -1 the formula (a) U (b) above becomes: 

AP^ 1 ixi P 2 fc ~ 2 x Q U (1) 
AP*" 1 tx AP*- 1 m Q U (2) 

pf- 2 x ap*- 1 >3 q u fs; 

AP/ 1 " 1 ixi AP^ 1 >a Q ft) 

where (a) expands P^ _1 and expands Pf _1 ; note that line (#,) and (^j are identi- 
cal. The enhancement described in (Balbin and Ramamohanarao 1987; Zaniolo ct al. 1997) 
provides a solution to this problem rewriting the original rule in: 

AP fe = APf" 1 ex P^ 1 ex Q U 

Pf" 2 ex AP^ 1 ex Q 

which, indeed, avoids to re-compute joins in (2), (4) more times. 

Generalizing the solution to a rule having r predicates mutually recursive with its 
head, the differentiation is obtained by subdividing the original rule in r sub-rules 
such that the i-th sub-rule has the form Apg : -p\~ 2 \ . . . ,f>£r x 2 , Ap^,p^, . . . , 
Pr^A- Note that both relations Pf~\ P/~ 2 and APj" -1 are considered. 

In our approach, we follow a slightly different strategy, which both unfolds each 
relation Pj 1 in Pj 2 and APj 0-1 and avoids to produce the redundant sub-rules 
of the standard Semi-Naive method. This is carried out as follows. Let us tag the 
differential relations (APj 1-1 ) with the symbol 1 and the standard ones (Pj~ 2 ) with 
the symbol 0. Given a generic rule with r predicates pi, . . . ,p r in its body mutually 
recursive with the head, our approach follows the binary enumeration between 1 
and 2 r — 1 and, for each of these binary numbers, it generates a differential rule; 
in particular, if position j on the binary number contains a 0, then P^~ 2 is put in 
the corresponding rule, otherwise APj^ 1 is used. As for the previous example, rule 
(n) is evaluated, in our approach, with joins (1), (2) and (3) shown above. 



AP fe 



(a) 
(b) 
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Differential Semi-Naive (Input : ivi , . . . , R;. Output: Qi , . . . , Q m , P^ , . . . , P n ) 
begin 

for i: = l to m do // Evaluate non recursive predicates 

(1) Q i =EVAL(q i , J?!,-.., P,, Qi, . . . , Q m ); 
for i: = l to n do begin // Initialize recursive predicates 

(2) 2 = EVAL (Pi. %>•■■. «(, Ql,- ■, Qm): 

(3) AP i_1 = P*" -2 ; 
end: 

repeat 



for i: = l to n do be_ 



(4) AP k = EVAL.DIFF( Pi , P fc 2 , . . . , P k 2 , AP fc 1 , . . . , AP^ 1 , Pi , . . . , P z ,Qi , . . . , Q m ); 



(5) A P. = A P . - P" - - AP" 



for i: = l to n do begin 

(6) p»=-2 _ pfc-2 u &p k 

(7) AP fc_1 = AP fc ; 
end ; 

until AP fc = ; Vi 1 < i < n; 
for i: = l to 



(8) Pi = P k 2 j 



Fig. 3. Algorithm Differential Semi-Naive 



Note that this approach generates a higher number of auxiliary rules w.r.t. 
(|Balbin an d Ramamohanarao 1987; Zani olo et al. 1997]) but, while avoiding to ex- 
ecute the same set of redundant joins, it allows handling smaller relations. This 
could constitute a good advantage when handling massive amounts of data, be- 
cause managing several small joins can be less resource demanding than executing 
few big ones. 

The algorithm implemented in our system for the differential semi-naive evalua- 
tion strategy described above is shown in Figure [31 It is executed for each compo- 
nent Vc of the input program V and assumes that input DLP- 4 rules have been 
already translated to SQL statements. Here, the component Vc depends on predi- 
cates r*i , . . . , r n solved in previous components and has qi , . . . , q m as non recursive 
predicates or facts and p%, . . . ,p n as recursive predicates. 

Function EVAL(qi, R\, . . . , Ri, Q%, . . . , Q m ) performs the evaluation of the non 
recursive rules having qi as head as follows: it first runs each SQL query correspond- 
ing to a rule having qi as head; then, the corresponding results are added to the 
relation Qi. 

Function EVAL-DIFFfa, P^ 2 , Pn~ 2 , AP*" 1 , . . . , AP* _1 ,_Ri, . . . , R h Q x , . . . , 
Qm) implements the optimization to the Semi-Naive method; it computes the new 
values for the predicate pi at the current iteration k starting from the values com- 
puted until iteration k — 2 and the new values obtained at the previous iteration 
k — 1. In more detail, the SQL statements corresponding to each recursive rule 
having pi as head are considered. The final result of EVAL_DIFF is stored in ta- 
ble AP/\ Clearly, it cannot be proved that EVAL_DIFF does not recompute some 
truth values already obtained in previous iterations. As a consequence, AP^ must 
be cleaned up from these values after the computation of EVAL_DIFF; this is ex- 
actly what is done by instruction (5) of the algorithm. Instruction (6) and (7) are 
needed to reuse the same relations (AP/% Af 1 ^ 1 , P^ ) at each iteration. 

Finally, it is worth pointing out that the last for of the algorithm (instruction 
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(8)) is shown just for clarity of exposition; in fact, in the actual implementation, 
what we indicated as P^~ 2 is exactly table Pi. 

It is worth pointing out that the basic step of the evaluation is the execution 
of standard SQL queries over the underlying data. In fact, one of the main ob- 
jectives in the implementation of DIN DB has been that of associating one single 
(non recursive) SQL statement with each rule of the program (either recursive or 
not), without the support of main-memory data structures for the evaluation. This 
allows DLV DB to minimize the "out of memory" problems caused by limited main- 
memory dimensions. Moreover, the overall organization of the evaluation strategy 
allows benefiting from both the optimizations on the intensional component of 
the program (the program rewriting techniques outlined at the beginning of this 
section) and the optimizations on the extensional component (the data) already 
implemented in the DBMS configured as the working database. 

The combination of such optimizations, along with a wise translation of datalog 
rules in efficient SQL queries allow DIN DB to boost the evaluation process even 
w.r.t. main-memory evaluation strategies (see Section [6]). 

4.2 From DLP A to SQL 

In this section we describe the general functions exploited to translate DLP -4 rules 
in SQL statements. Functions are presented in pseudocode and, for the sake of 
presentation clarity, they omit some details; moreover, since there is a one-to-one 
correspondence between the predicates in the logic program and the relations in the 
database, in the following, when this is not confusing, we use the terms predicate and 
relation interchangeably. It is worth recalling that these one-to-one correspondences 
are determined both from the user specifications in the auxiliary directives and from 
the mappings automatically derived by the system. 

In order to provide examples for the presented functions, we exploit the following 
reference schema: 

employ ee{Ename, Salary, Dep, Boss) department ( Code , Director) 

storing information about the employees of the departments of a given company. 
Specifically, each employee has associated a Boss who is, in his turn, an employee. 

Translating Non-recursive Rules. 

Non recursive rules are translated in a quite standard way in SQL. The only 
exceptions are made for rules containing aggregate functions and rules containing 
built-ins. The general format of the SQL statement generated in the translation is: 

INSERT INTO head(r) (Translate_SQL(r)) 

where head(r) returns the relation associated with the head of r; this task is car- 
ried out by considering the mappings specified in the auxiliary directives. Trans- 
late_SQL(r) takes into account the kind of rule (e.g., if it contains negation or 
built-ins, etc.) and calls the suitable transformation function. These functions are 
described next. 
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Translating Positive Rules. 

Intuitively, the SQL statement for positive rules is composed as follows: the SE- 
LECT part is determined by the variable bindings between the head and the body 
of the rule. The FROM part of the statement is determined by the predicates com- 
posing the body of the rule; variable bindings between body atoms and constants 
determine the WHERE conditions of the statement. Finally, an EXCEPT part is 
added in order to eliminate tuple duplications. The behaviour of function Trans- 
latePositiveRule is well described by the following example. 

Example 4.1 

Consider the following rule: 

qo(Ename) :- employ •ee(Ename, 100.000 , Dep, Boss), department (Dep, rossi). 

which returns all the employees working at the department whose chief is rossi and 
having a yearly salary of 100.000 euros. The corresponding SQL statement is the 
following^: 

INSERT INTO go ( 
SELECT employee. atti FROM employee, department 
WHERE employee. attz = department. atti AND department. aii2='rossi' 

AND employee.aii 2 =100.000 EXCEPT (SELECT * FROM q )) □ 

Translating rules with negated atoms. 

Intuitively, the construction of the SQL statement for this kind of rule is carried 
out as follows: the positive part of the rule is handled in a way very similar to 
what has been shown for function TranslatePositiveRule; then, each negated atom 
is handled by a corresponding NOT IN part in the statement. The behaviour of 
function TranslateRuleWithNegation is well illustrated by the following example. 

Example 4.2 

The following program computes (using the goal topEmployee) the employees which 
have no other boss than the director. 

topEmployee(Ename) :- employ ee(Ename, Salary , Dep, Boss), 

department (Dep, Boss), 
not other Boss(Ename, Boss). 

otherBoss(Ename, Boss) :- employ ee(Ename, Salary , Dep, Boss), 

employ ee(B oss , Salary, Dep, Bossl). 

The first rule above is translated to the following SQL statement: 

INSERT INTO topEmployee ( 
SELECT employee. atti FROM employee, department 

WHERE (employee. att3=department.atti) AND (employee. att4=department.att2) 
AND (employee. atti, employee. att,j) 

NOT IN (SELECT otherBoss.atti, otherBoss.att 2 FROM otherBoss ) 
EXCEPT (SELECT * FROM topEmployee)) □ 

5 Here and in the following we use the notation t.att; to indicate the i-th attribute of the table 
t. Actual attribute names are determined from the auxiliary directives. 
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Translating rules with built-in predicates. 

As pointed out in Section [2l in addition to user-defined predicates some com- 
parative and arithmetic predicates are provided by the reasoning language. When 
running a program containing built-in predicates, the range of admissible values for 
the corresponding variables must be fixed. We map this necessity in the working 
database by adding a restriction based on the maximum value allowed for variables. 
Moreover, in order to allow mathematical operations among attributes, DIN DB re- 
quires the types of attributes to be properly defined in the auxiliary directives. 

The function for translating rules containing built-in predicates is a slight varia- 
tion of the function for translating positive rules. As a matter of fact, the presence 
of a built-in predicate in the rule implies just adding a corresponding condition 
in the WHERE part of the statement. However, if the variables specified in the 
built-in are not bound to any other variable of the atoms in the body, a $:maxint 
value must be exploited to bound that variable to its admissible range of values. 

Example 4-3 
The program: 

qi(Ename) : — employ ee(Ename, Salary , Dep, Boss), Salary > 100.000. 

is translated to the SQL statement: 
INSERT INTO q x 

(SELECT employee.atti FROM employee WHERE employee.att 2 > 100.000 
EXCEPT (SELECT * FROM gi )) □ 

Translating rules with aggregate atoms. 

In Section[5]we introduced the syntax and the semantics of DLP with aggregates. 
We have also shown that specific safety conditions must hold for each rule containing 
aggregate atoms, in order to guarantee the computability of the corresponding rule. 
As an example, aggregate atoms can not contain predicates mutually recursive with 
the head of the rule they are placed in; from our point of view, this implies that 
the truth values of each aggregate function can be computed once and for all before 
evaluating the corresponding rule (which can be, in its turn, recursive). 

Actually, the process that rewrites input programs before their execution, auto- 
matically rewrites each rule containing some aggregate atom in such a way that it 
follows a standard format (we call this process standardization in the following). 
Specifically, given a generic rule of the form: 

head :- body, f({Vars : Conj}) -< Rg. 

where Conj is a generic conjunction and Rg is a guard, the system automatically 
translates this rule to a pair of rules of the form 

auxAtom :- Conj, Binding Atoms. 

head :- body, f({Vars : auxAtom}) -< Rg. 

where auxAtom is a standard rule containing both Conj and the atoms (Binding Atoms) 
necessary for the bindings of Conj with body and/or head. Note that auxAtom con- 
tains only those attributes of Conj that are strictly necessary for the computation 
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Function TranslateAggregateRule(VAR r: DLP" 4 rule): SQL statement 
begin 

for each a in aggr_atom(r) do begin 
cmir:=aux_atom(a); 

SQL:= "CREATE VIEW " + aux +".supp" + 
"AS (SELECT "+ bound-attr(a) + ", " + 

aggr_func(a) + "(" + aggr.attr(a) + ") " + 
"FROM " + aux + "GROUP BY " + bouncLattr(a) + ")"; 
removeFromBody(r, a); 
addToBody(r, aux_atom_supp(a)); 
addToBody(r, guards(a)); 
end; 

return SQL; 
end. 



Fig. 4. Function TranslateAggregateRule 

of / and, consequently, it may have far less (and can not have more) attributes 
than those present in Conj. 

In our approach we rely on this standardization to translate this kind of rule to 
SQL; clearly only the second rule, containing the aggregate function, is handled by 
the function we are presenting next; in fact, the first rule is automatically translated 
by one of the already presented functions. 

Intuitively, the objective of our translation is to create an SQL view aux Atom _supp 
from aux Atom which contains all the attributes necessary to bind aux Atom with 
the other atoms of the original rule and a column storing the results of the com- 
putation of / over auxAtom; the original aggregate atom is then replaced by this 
view and guard conditions are suitably translated by logic conditions between vari- 
ables. At this point, the resulting rule is a standard rule not containing aggregate 
functions and can be then translated by one of the functions we have presented 
previously; clearly enough, in this process, the original input rule r must be mod- 
ified to have a proper translation of its "standard" part. The function is shown in 
Figure [U it receives a rule r with aggregates as input and returns both the SQL 
views for the aggregate functions in r and the modified (standard) r, which will be 
handled by standard translation functional! 

Here function aggr_atom(r ) returns the aggregate atoms present in r; aux-atom(a) 
returns the auxiliary atom corresponding to Conj of a and automatically generated 
by the standardization. Function bound_attr(a) yields in output the attributes of the 
atom a bound with attributes of the other atoms in the rule, whereas aggr_attr(a) 
returns the attribute which the aggregation must be carried out onto (the first vari- 
able in Vars). aggr_func(a) returns the SQL aggregation statement corresponding 
to the aggregate function of a. Functions removeFromBody and addToBody are re- 
sponsible of altering the original rule r to make it standard (without aggregates). 
In particular, removeFromBody (r , a) removes the aggregate atom a from the rule 

6 Here and in the following we use the operator + to denote the "append" operator between 
strings. 
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r, whereas addToBody adds both aux-atomsupp(a) and guards(a) to r. Note that 
aux-atomsupp(a) yields in output the name of the atom corresponding to the just 
created auxiliary view, whereas guards (a) converts the guard of the aggregate atom 
a in a logic statement between attributes in the rule. 

Example 4-4 

Consider the following rule computing the departments which spend for the salaries 
of their employees, an amount greater than a certain threshold, say 100000: 

costlyDep(Dep) :- department(Dep, _ ), 

#sum{Salary, Ename : employ ee(Ename, Salary , Dep, _)} > 1000000 

The standardization automatically rewrites this rule as: 

aux^emp (Salary, Ename, Dep) :- department (Dep, _ ), 

employ ee(Ename , Salary, Dep, _ ). 
costly Dep (Dep) :- department(Dep, _ ), 

#sum{Salary, Ename : aux_emp (Salary, Ename, Dep)} > 100000. 

The first rule is treated as a standard positive rule and is translated to: 

INSERT INTO aux.emp ( 
SELECT employee. att2, employee. atti, department. atti 
FROM department, employee WHERE department. atti = employee. att3 
EXCEPT (SELECT * FROM aux.emp)) 

The second rule is translated to: 

CREATE VIEW aux.emp jupp AS ( 
SELECT aux.emp. att 3 , SUM (aux.emp. atti) FROM aux.emp 
GROUP BY aux_emp.att 3 ) 

INSERT INTO costlyDep ( 
SELECT department. atti FROM department, aux_emp_supp 

WHERE department. atti = aux_emp.supp.atti AND aux.emp_supp.att2 > 100000 
EXCEPT (SELECT * FROM costlyDep)) □ 

Translating recursive rules. 

As previously pointed out, our program evaluation strategy exploits a refined 
version of the Semi-Naive method. This is based on the translation of a recursive 
rule into a non recursive SQL statement operating alternatively on standard and 
differential versions of the relations associated with recursive predicates. Each time 
this statement is executed by the algorithm, it must compute just the new values 
for the predicate in the head that can be obtained from the values computed in the 
last two iterations of the fixpoint. 

Intuitively, the translation algorithm must first select the proper combinations 
of standard and differential relations from the rule r under consideration; then, for 
each of these combinations, it must rewrite r in a corresponding rule r' . Each r' 
thus obtained is non recursive and, consequently, it can be handled by Function 

7 Note that Ename is needed to sum also the salaries of employees earning the same amount (see 
the discussion on sets/multisets in {pcH'Armi ct al. 2003a]}). 
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Function TranslateRecursiveRule(r: DLP^ rule ): SQL statement 
begin 

5QL:=""; 

if (hasAggregate (r ) ) 

SQL:=TranslateAggregateRule(r); 

^. = 2RecursivePredicates(r)_^ 

SQL:=SQL+" INSERT INTO " + Ahead(r) + "("; 
for i:=l to n do begin 

Let r' be a rule; 

setHead(r', Ahead(r)); 

for each non recursive predicate qj in body(r) do 

addToBody(r', qj); 
for each recursive predicate pj in body(r) do 

if (bit(j,i)=0) then addToBody(r', p^ -2 ); 

else addToBody(r', Ap*^ 1 ): 
if (i 5* 1) SQL:=SQL+"UNION "; 
SQL:=SQL + TranslateNonRecursiveRule (/); 
end; 

SQL:=SQL + ")"; 
return SQL; 
end. 



Fig. 5. Function TranslateRecursiveRule 



TranslateNonRecursiveRule. Algorithm TranslateRecursiveRule is shown in Figure 

El 

Here, functions TranslateAggregateRule and TranslateNonRecursiveRule have been 
introduced previously. Function hasAggregate(r) returns true if r contains aggre- 
gate functions. Function RecursivePredicates(r) returns the number of occurrences 
of recursive predicates in the body of r; Aheadfr) returns the differential version of 
the relation corresponding to the head of r. Function setHead(r' , p) sets the head 
of the rule r' to the predicate p; analogously, function addToBody(r' , p) adds to 
the body of r' a conjunction with the predicate p. Function bit(j,i) returns the j-th 
bit of the binary representation of i. 

It is worth noticing that the execution of the queries resulting from function 
TranslateRecursiveRule implement function EVAL_DIFF for r (see the algorithm 
of Figure [3]) . 

Example 4-5 

Consider the situation in which we need to know whether the employee e\ is the 
boss of the employee e n either directly or by means of a number of employees 
e2, ..,e n such that e\ is the boss of e%, e-i is the boss of etc. Then, we have to 
evaluate the program: 

n : 92(^1,^2) :~ employ ee (E 1 , Salary , Dep, E 2). 
r 2 : q 2 (Ei,E 3 ) :- q 2 (Ei,E 2 ), q2(E 2 , E 3 ). 

containing the recursive rule r%. This program cannot be evaluated in one single 
iteration of the Semi-Naive computation. Rule n is not recursive; it is translated 
by Function TranslatePositiveRule to the following SQL which is evaluated once: 
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INSERT INTO q 2 ( SELECT employee.atti, employee.atU FROM employee 
EXCEPT (SELECT * FROM q 2 )) 

Rule r 2 is first translated by Function TranslateRecursiveRule to the temporary 
set of rules: 

r 2 : Aql(E 1 ,E 3 ) :- q^ 2 {E u E 2 ) 1 Aqt 1 (E 2 ,E 3 ). 
AgJ^i.^a) :- Aq 2 i -\E 1 ,E 2 ), qt 2 (E 2 , E 3 ). 
Aq^Es) :- Aqt\E 1 ,E 2 ), Aq^ 1 (E 2: E 3 ). 

which is translated to: 

INSERT INTO Aq% ( 
SELECT q*- 2 .atti, Aq%- 1 .att2 FROM q\~ 2 ,A<?2 _1 WHERE (q^ 2 .att^Aq^ 1 .atti) 
EXCEPT (SELECT * FROM Ag£) 
UNION 

SELECT Aq^.atti, q\~ 2 .att2 FROM Ag^T 1 , q^ 2 WHERE {Aq^ 1 .att 2 =q*~ 2 '.aUi) 

EXCEPT (SELECT * FROM A?f) 

UNION 

SELECT Aq^.attx, Aq^~ 1 A.att2 FROM Aq%~ 1 , Aq^' 1 AS Ag* -1 .! 
WHERE (A^- 1 .att 2 =Ag 2 fc - 1 _l.atti) 
EXCEPT (SELECT * FROM Agf)) 

Actually, the real implementation of this function adds, for performance reasons, 
also the following parts to the statement above: 

EXCEPT (SELECT * FROM Ag*" 1 ) 
EXCEPT (SELECT * FROM q^ 2 ) 

Note that, following Algorithm Differential Semi-Naive (Figure El), q 2 ~ 2 and Aq^' 1 
are first initialized with the result of the evaluation of r% (stored in q 2 - see instruc- 
tions (2) and (3) in Figure[3J). Then, the SQL above is iteratively executed until the 
fixpoint is reached. Note that, the aforementioned process executes instructions (1)- 
(5) of the algorithm in Figure[3] The update of q 2 2 and Aq 2 ~ x from one iteration 
to the subsequent one is carried out by instructions (6) and (7) in a straightforward 
way. □ 



4-2.1 A complete example 

Example 4- 6 

Consider the datalog program presented in Example 13.11 and the mappings shown 
in Figure [2l The complete query plan derived by DIN DB for them is: 

(1) INSERT INTO destinations_rel 

(SELECT f.FromX, f.ToY, f.Company FROM flight_rel AS f) 

(2) INSERT INTO destinations_rel 

(SELECT f.FromX, f.ToY, c.Company2 FROM flightjel AS f, codeshare_rel AS c 
WHERE (f.Id=c.FlightId) AND (f.Company=c.Companyl) 
EXCEPT (SELECT * FROM destinations_rel)) 



Theory and Practice of Logic Programming 



21 



(3) INSERT INTO d.destinations_rel 

(SELECT dl.FromX, d2.ToY, dl. Company 

FROM dl_destinations_rel AS dl, destinations_rel AS d2 

WHERE (dl.ToY=d2.FromX) AND (dl.Company=d2.Company) 

UNION 

SELECT dl.FromX, d2.ToY, dl. Company 

FROM destinations_rel AS dl, dl_destinations_rel AS d2 

WHERE (dl.ToY=d2.FromX) AND (dl.Company=d2.Company) 

UNION 

SELECT dl.FromX, d2.ToY, dl.Company 

FROM dl_destinations_rel AS dl, dl_destinations_rel AS d2 

WHERE (dl.ToY=d2.FromX)AND (dl.Company=d2.Company) 

EXCEPT (SELECT * FROM dl.destinations_rel) 

EXCEPT (SELECT * FROM destinations_rel) 

EXCEPT (SELECT * FROM d_destinations_rel)) 

SQL statements (1) and (2) are executed only once, since they correspond to 
non recursive rules. On the contrary, the statement (3) is executed several times, 
until the least fixpoint is reached, i.e. d_destinationsjrel is empty. Note that 
d_destinationsjrel and dl_destinationsjrel correspond, respectively, to Ahead(r) 
and Ap k ~ 1 introduced in function TranslateRecursiveRule; as shown in Section [4. II 
the evaluation algorithm suitably updates the tuples of destinationsjrel from 
the new values derived at each iteration in d_destinationsjrel. □ 

5 System Architecture 

In this section we present the general architecture of our system. It has been de- 
signed as an extension of the DLV system (|Leone et al. 2004)) . which allows both 
the evaluation of logic programs directly on databases and the handling of input 
and output data distributed on several databases. It combines the expressive power 
of DLV (and the optimization strategies implemented in it) with the efficient data 
management features of DBMSs (jGarcia-Molina et al. 2000|) . 

As previously pointed out, the system provides two, quite distinct, functioning 
modalities, namely the direct database execution and the main-memory execution 
modality. In the following we present the two corresponding architectures sepa- 
rately. 

5.1 Architecture of the direct database execution (DLV DB ) 

Figure [6] illustrates the architecture of the system for the direct database execution. 
In the figure, the boxes marked with DLV are the ones already developed in the 
DLV system. An input program V is first analyzed by the Parser which encodes the 
rules in the intensional database (IDB) in a suitable way and builds an extensional 
database (EDB) in main-memory data structures from the facts specified directly 
in the program (if any). As for facts already stored in database relations, no EDB is 
produced in main-memory. After this, the Optimizer applies a rewriting procedure 
in order to get a program V', equivalent to V, that can be evaluated more efficiently; 
some of the operations carried out by this module have been highlighted in Section 
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14.11 The Dependency Graph Builder computes the dependency graph of V' , its 
connected components and a topological ordering of these components. Finally the 
DB Instantiator module, the core of the system, is activated. 

The DB Instantiator module receives: (i) the IDB and the EDB (if not empty) 
generated by the parser, (ii) the Dependency Graph (DG) generated by the depen- 
dency graph builder and (Hi) the auxiliary directives specifying the needed interac- 
tions between DLV DB and the databases. It evaluates the input program through 
the bottom- up fixpoint evaluation strategy shown in Section Q] Since the input pro- 
gram is supposed to be normal and stratified (see Section [2|) , the DB Instantiator 
evaluates completely the program and no further modules must be employed after 
it. 

All the instantiation steps are performed directly on the working database through 
the execution of SQL statements and no data is loaded in main-memory from the 
databases in any phase of the process. This allows DLV DB to be completely inde- 
pendent of the dimension of both the input data and the number of facts generated 
during the evaluation. 

Communication with databases is performed via ODBC. This allows DLV^ 5 both 
to be independent from a particular DBMS and to handle databases distributed 
over the Internet. 

It is important to point out that the architecture of DLV DB has been designed in 
such a way to fully exploit optimizations both from logic theory and from database 
theory. In fact, the actually evaluated program is the one resulting from the Op- 
timizer module which applies program rewriting techniques aiming to simplify the 
evaluation process and to reduce the dimension of the involved relations (see Sec- 
tion |4J]). Then, the execution of the SQL statements in the query plan exploit 
data-oriented optimizations implemented in the DBMS. As far as this latter point 
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is concerned, we have experienced that the kind of DBMS handling the working 
database for F>LV DB may significantly affect system performance; in fact, when 
F)LV DB was coupled with highly sophisticated DBMSs it generally showed better 
performance in handling large amounts of data w.r.t. the same executions when 
coupled with less sophisticated DBMSs. 

The observation above points out both the importance of data-oriented optimiza- 
tions and a potential advantage of DIN DB w.r.t. deductive systems operating on 
ad-hoc DBMSs. In fact, DLV DB can be easily coupled with the most efficient DBMS 
available at the time being used (provided that it supports standard SQL), whereas 
the improvement of an ad-hoc DBMS is a more difficult task. 

5.2 Architecture of the main-memory execution (DLV 10 ) 

The architecture of DLV /0 is illustrated in Figure [7] It extends the classical DLV 
architecture with ODBC functionalities to import/export data from/to database 
relations. The main-memory execution modality acts just as an interface (based on 
ODBC connections) between the external databases and the standard DLV program. 

In more detail, input data can be supplied both by regular files and by relational 
tables accessed via ODBC as specified by the #import commands. Specifically, for 
each ^import command the system retrieves data from the corresponding table 
"row by row" through the SQL query specified by the user and creates one atom in 
main-memory (in the format required by DLV) for each selected tuple. The name of 
each imported atom is set to predname, and is considered as a fact. Possible facts 
residing in text files are fed into DLV in the standard way. All the data is fetched 
in main-memory before any evaluation task is carried out. 

The DLV kernel (the shaded part in the figure) then produces answer sets one 
at a time. It consists of three major components: the "Intelligent Grounding'!!, 
the "Model Generator" , and the "Model Checker" modules; these share a main 
data structure, the "Ground Program" . It is created by the Intelligent Grounding- 
using differential (and other advanced) database techniques together with suitable 
main- memory data structures, and used by the Model Generator and the Model 
Checker. The Ground Program is guaranteed to have exactly the same answer 
sets as the original program. For some syntactically restricted classes of programs 



8 It incorporates the Parser, the Optimizer and the DG Builder depicted in Figure [6] 



24 



G. Terracina, N. Leone, V. Lio, C. Panetta 



(e.g. stratified programs), the Intelligent Grounding module already computes the 
corresponding answer sets. 

For harder problems, most of the computation is performed by the Model Genera- 
tor and the Model Checker. Roughly, the former produces some "candidate" answer 
sets (models) (jFaber et al. 1999b[ IFaber et al. 2001)) . the stability and minimality 
of which are subsequently verified by the latter. 

The Model Checker verifies whether the model at hand is an answer set. This task 
is very hard in general, because checking the stability of a model is known to be co- 
NP-complete. However, this module exploits the fact that minimal model checking 
- the hardest part — can be efficiently performed for the relevant class of head- 



cycle-free (HCF) programs (Ben-Eliyahu and Dechter 1994 Ben-Eliyahu and Dechter 1996). 

Each time an answer set M is found, "Filtering" is invoked, which performs 
some post-processing, controls continuation or abortion of the computation, and 
possibly stores the output data in the corresponding relational tables as specified 
by the #export commands. In particular, if an ^export command from predname 
to tablename is present, the module generates a new tuple in tablename for each 
atom in M having name prednam^. 



6 Experiments and Benchmarks 

In this section we present our experimental framework and the results obtained com- 
paring the DLV DB system with several state-of-the-art systems. Benchmarks have 
been designed following the guidelines, problems and data structures proposed in 
(jBancilhon and Ramakrishnan 1988 ) and (jGreco 2003[) to assess the performance of 
deductive database systems. Roughly speaking, problems used in (jBancilhon and Ramakrishnan 1988")) 
and (jGreco 2003[) basically resort to the execution of some recursive queries on a 
variety of data structures. The main goal of our experiments was to evaluate the 
deductive capabilities of tested systems for both query answering time and amount 
of manageable data, especially with respect to the direct database execution of our 
system. 

All tests have been carried out on a Pentium 4 machine with a 1.4 GHz CPU 
and 512 Mbytes of RAM. 



6.1 Overview of Compared Systems 

In order to provide a comparative and comprehensive analysis with the state-of- 
the-art systems in the considered research area, we have compared our system 
performance, under both execution modalities (i.e, both T)\N DB and DLV /0 ), with: 
(i) LDL++, because it is one of the most robust implementations of deductive 
database systems; (ii) XSB, as an efficient implementation of the Top-Down evalu- 
ation strategy; (Hi) Smodels, one of the most widely used Answer Set Programming 
systems together with DLV; (iv) three commercial DBMSs supporting the execution 

9 As previously pointed out, the presence of an ^export command automatically limits the system 
to generate the first answer set only. 
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of recursive queries. Note that the licence of use of such DBMSs does not allow us 
to explicitly mention them in the paper; as a consequence, in the following, we call 
them simply DB-A, DB-B and DB-C. The reader should just know they are the 
three top-level commercial database systems currently available, which also support 
recursive queries. 

Note that important DBMSs, such as Postgres and MySQL could not be tested; 
in fact, they do not support recursive queries, which are the basis for our testing 
framework. Moreover, as we pointed out in the Introduction, other logic-based 
systems such as ASSAT, Cmodels, and CLASP have not been tested since they 
use the same grounding layer of Smodels (LParse) and, as it will be clear in the 
following, the benchmark programs are completely solved by this layer. 

In the following we briefly overview the main characteristics of the tested systems, 
focusing on their support to the language and technological capabilities addressed 
in this work. Specifically, we consider, for each database system, its capability to 
express recursive queries and, for each logic-based system, the expressiveness of its 
language and its capability to interact with external DBMSs. 

For each system, we used the latest release available at the time tests have been 
carried out. 

6.1.1 Database systems 

As far as database systems are concerned, it is worth pointing out that none of 
the considered ones fully adopt the SQL99 standard for the definition of recursive 
queries, but proprietary constructs are introduced by each of them. 

In particular, both DB-A and DB-B support the standard recursive functional- 
ities that are needed for our benchmarks, even if proprietary constructs must be 
added to the standard SQL99 statement to guarantee the termination of some kinds 
of queries. On the contrary, DB-C implements a large subset of SQL99 features and 
supports recursion but, as far as recursive queries are concerned, it exploits pro- 
prietary constructs which do not follow the standard SQL99 notation, and whose 
expressiveness is lower than that of SQL99; as an example, it is not possible to 
express unbound queries within recursive statements (e.g., all the pairs of nodes 
linked by at least one path in a graph). 

6.1.2 LDL + + 

The LDL++ system (jArni et al. 2003j) integrates rule-based programming with ef- 
ficient secondary memory access, transaction management recovery and integrity 
control. The underlying database engine has been developed specifically within the 
LDL project and is designed as a virtual-memory record manager, which is opti- 
mized for the situation where the pages containing frequently used data can reside 
in main-memory. LDL++ can also be interfaced with external DBMSs, but it is 
necessary to implement vendor-specific drivers to handle data conversion and lo- 
cal SQL dialects (jArni et al. 2003|) . The LDL++ language supports complex terms 
within facts and rules, stratified negation, and don't care non-determinism based 
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on stable model semantics. Moreover, LDL++ supports updates through special 
rules. 

In our tests we used version 5.3 of LDL++. Test data have been fed to the system 
by text files storing input facts. 



6.1.3 XSB 

The XSB system (|Rao et al. 1997[) is an inmemory deductive database engine based 
on a Prolog/SLD resolution strategy called SLG. It supports explicitly locally strat- 
ified programs. The inference engine, which is called SLG-WAM, consists of an 
efficient tabling engine for definite logic programs, which is extended by mecha- 
nisms for handling cycles through negation. These mechanisms are negative loop 
detection, delay and simplification. They serve for detecting, breaking and resolving 
cycles through negation. 

XSB allows the exploitation of data residing in external databases, but reasoning 
on such data is carried out in main-memory. The version of XSB we used in our 
tests is 2.2. 



6.1.4 SModels 

The SModels system (Nicmela et al. 2000; Niemcla and Si mons 1997|) implements 
the answer set semantics for normal logic programs extended by built-in functions 
as well as cardinality and weight constraints for domain-restricted programs. 

The SModels system takes as input logic program rules in Prolog style syntax. 
However, in order to support efficient implementation techniques and extensions, 
the programs are required to be domain-restricted where the idea is the following: 
the predicate symbols in the program are divided into two classes, domain predicates 
and non-domain predicates. Domain predicates are predicates that are defined non- 
recursively. The main intuition of domain predicates is that they are used to define 
the set of terms over which the variable range in each rule of a program P. All 
rules of P have to be domain-restricted in the sense that every variable in a rule 
must appear in a domain predicate which appears positively in the rule body. In 
addition to normal logic program rules, SMODELS supports rules with cardinality 
and weight constraints, which are similar to ^count and #sum aggregates of DLV. 

SModels does not allow to handle data residing in database relations; moreover, 
all the stages of the computation are carried out in main-memory. Finally, it does 
not support optimization strategies for bound queries; consequently, the time it 
needs for executing the same query either with all parameters unbound or with 
some parameters bound is exactly the same. 

In our tests we used SModels ver. 2.28 with Lparse ver. 1.0.17. Test data have 
been fed to the system by text files storing input facts. 
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6.1.5 DLV DB 

It is the direct database execution of our system; in our tests we used a commercial 
database as DBMS for the working database. However, to guarantee fairness with 
the other systems, we did not set any additional index or key information for the 
involved relations. We point out again that any DBMS supporting ODBC could be 
easily coupled with DLV DB . 

6.1.6 DLV 10 

It is the main-memory execution modality of the system presented in this paper. 
Recall that it basically corresponds to the execution of the standard DLV system 
with data loaded from databases. 



6.2 Benchmark Problems 

To asses the performance of the systems described above, we carried out several 
tests using classical benchmark problems from the context of deductive databases 
(Bancilhon and Ramakrishnan 1988; Greco 2003), namely Reachability and Same 
Generation. The former allows the analysis of basic recursion capabilities of the 
various systems on several data structures, whereas the latter implements a more 
complex problem and, consequently, allows the capability of the considered systems 
to carry out more refined reasoning tasks to be tested. 

For each problem, we measured the performance of the various systems in com- 
puting three kinds of queries, namely: unbound queries (identified by the sym- 
bol Qo in the following); queries with one bound parameter (Qi); queries with 
all bound parameters {0,2)- Considering these three cases is important because 
DBMSs and Deductive Databases generally benefit from query bindings (by "push- 
ing down" selections through relational algebra optimizations, magic set techniques, 
or, for XSB, top down evaluation), whereas ASP systems are generally more effec- 
tive with unbound queries (since they usually compute the entire models anyway); 
as a consequence, it is interesting to test all these systems in both their favor- 
able and unfavorable contexts. It is worth pointing out that some of the tested 
systems implement optimization strategies 'a la magic set' (jBancilhon et al. 19861 
IBeeri and Ramakrisnhan 19911 iMumick et al. 19961 IRoss 1990|) (e.g., DLV DB and 
LDL+- 1-), typical of deductive databases, or other program rewriting techniques; as 
a consequence, the actually evaluated programs are the optimized ones automat- 
ically derived by these systems, but the cost of these rewritings has been always 
considered in the measure of systems' performance. 

In what follows we briefly introduce the two considered problems; the interested 
reader can find all details about them in (Bancilhon and Ramakrishnan 1988). 



6.2.1 Reachability 



Given a directed graph G = (V, E) the solution to the reachability problem reach- 
able^, b) determines whether a node b e V is reachable from a node a e V through 
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a sequence of edges in E. The input is provided by a relation edge{X, Y) where a 
fact edge(a, b) states that b is directly reachable by an edge from a. 

In database terms, determining all pairs of reachable nodes in G amounts to 
computing the transitive closure of the relation storing the edges. 



6.2.2 Same Generation 

Given a parent-child relationship (a tree), the Same Generation problem aims to 
find pairs of persons belonging to the same generation. Two persons belong to the 
same generation either if they are siblings, or if they are children of two persons of 
the same generation. 

The input is provided by a relation parent(X,Y) where a fact parent(thomas, 
moritz) means that thomas is the parent of moritz. 



6.3 Benchmark Data Sets 

For each considered problem we exploited several sets of benchmark data struc- 
tures. For each data structure various instances of increasing dimensions have been 
constructed; the size of each instance is measured in terms of the number of input 
facts describing it. 



6.3.1 Reachability 

As for the Reachability problem, we considered: (i) full binary trees, (ii) acyclic 
graphs (a-graphs in the following), (Hi) cyclic graphs (c-graphs in the following), 
and (iv) cylinders (Bancilhon and Ramakrishnan 1988). 

The density S of a graph can be measured as S = ^ tt a j rCS ^ n -^! 1C S ra P n ^y e 
y b * # of possible arcs 

generated various typologies of graph instances, characterized by values of 5 equal 
to 0.20, 0.50 and 0.75 respectively. Due to space constraints, in this paper we report 
just the results obtained for 5 = 0.20. 

Cylinders are particular kinds of acyclic graphs which can be layered; each layer 
has the same number of nodes. Each node of the first layer has two outgoing arcs 
and no incoming arcs, whereas each node of the last layer has two incoming arcs 
and no outgoing arcs; finally, each node of an internal layer has two incoming and 
two outgoing arcs. An example of a cylinder is shown in Figure [8] A cylinder has 
then a width and a height; as a consequence, the ratio p — ^gjjj: can De exploited 
to characterize a cylinder. We generated various categories of cylinders having p 
equal to 0.5, 1.0 and 1.5 respectively. Due to space constraints, in this paper we 
report just the results obtained for p = 1. 

Graphs have been generated using the Stanford GraphBase (jKnuth 1994ft library 
whereas trees and cylinders have been generated using ad-hoc procedures, since they 
are characterized by a regular structure. 
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Fig. 8. Example of a cylinder graph. 

6.3.2 Same Generation 

As far as the Same Generation problem is concerned, we exploited full binary trees 
as input data structures. 



6.4 Problem Encodings 

We have used general encodings for the two considered problems in a way which 
tests the various systems under generic conditions; specifically, we used "uniform" 
queries, i.e. queries whose structure must not be modified depending on the quantity 
and positions of bound parameters. Several alternative encodings could have been 
possible for the various problems, depending also on the underlying data structures; 
however, since many other problems of practical relevance can be brought back to 
the ones we considered, we preferred to exploit those encodings applicable to the 
widest variety of applications. 

Due to space constraints we can not list here the encodings exploited in our tests. 



The interested reader can find them at the address http://www.mat.unical.it/ 
terracina/tplp-dlvdb/encodings.pdf. 

Note that, since DB-C does not support the standard SQL99 language, but only 
a simplified form of recursion, we have not tested this system along with the other 
ones. We will discuss encodings and results obtained for DB-C in a separate section. 



6.5 Results and Discussion 

In our tests we measured the time required by each system to answer the various 
queries. We fixed a maximum running time of 12000 seconds (about 3 hours) for 
each test. In the following figures, the line of a system stops whenever some query 
was not solved within this time limit (note that graphs have a logarithmic scale on 
the vertical axis). 

In more detail, Figures I51TTT1 show results obtained for the various tests; the head- 
line of each graph reports the corresponding query. 

From the analysis of these figures we can observe that, in several cases, the 
performance of DIN DB (the black triangle in the graphs) is better than all the 
other systems with orders of magnitude and that DLV DB allows almost always the 
handling of the greatest amount of data; moreover, there is no system which can 
be considered the "competitor" of F>LV DB in all the tests. 

In particular, in some tests, XSB shows a good behaviour (e.g., in Reachability on 
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Fig. 9. Results for Same Generation on trees and Reachability with acyclic graphs 



cyclic graphs and cylinders) but, even in those positive tests, it "dies" earlier than 
DLV DB (with the exception of reachable(bl,Y) on cylinders), probably because it 
exceeds the main-memory. 

LDL++ is competitive with DLV DB only in reachable(bl,Y) on cyclic graphs 
and cylinders, whereas in all the other queries the performance difference is of more 
than one order of magnitude. 

DB-B performance is near to that of DIN DB only in samegen(X,Y); in all the 
other cases its line is near to the vertical axis. 

DB-A showed very good performance only for reachability on trees (see also 
Table [T] introduced next). This behaviour could be justified by the presence of 
optimization mechanisms implemented in this system which are particularly suited 
for computing the transitive closure on simple data structures (like trees), but these 
are not effective for other (more complex) kinds of query/data structure. 
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Surprisingly enough, DBMSs often have the worst performance (their times are 
near to the vertical axis) and they can handle very limited amounts of input data. 

Finally, as expected, DLV /0 is capable of handling lower amounts of data w.r.t. 
DLV DB ; however, in several cases it was one of the best three performing systems, 
especially on bound queries. This result is mainly due to the magic sets optimization 
technique it implements. 

A rather surprising result is that DLV 7 ° has almost always higher execution 
times than DLV DB even for not very high input data sizes. The motivation for 
this result can be justified by the following reasoning. Both D1N DB and DLV 7 ° 
benefit from all the program rewriting optimization techniques developed in the 
DLV project; moreover, both of them implement a differential Semi-Naive approach 
for the evaluation of normal stratified programs. However, while DLV /0 reasons 
about its underlying data in a tuple-at-a-time way, BUV DB exploits a set-at-a-time 
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strategy (implemented by SQL queries); this, in conjunction with the fact that 
the underlying working database implements advanced data-oriented optimization 
strategies, makes DLV DB more efficient than DLV 7 ° even when all the data fits in 
main-memory. 

As pointed out also in (Bancilhon and Ramakrishnan 1988), another important 
parameter to measure in this context is the system's capability of handling large 
amounts of data. In order to carry out this verification, we considered the time 
response of each system for the largest input data set we have used in each query. 

Table [1] shows the execution times measured for those systems which have been 
capable of solving the query within the fixed time limit of 12000 seconds; the second 
column of the table shows, for each query, both the input data size, measured in 
terms of the number of input facts (tuples), and the total amount of handled data, 
measured in Mbytes,given by the size of the answer set produced by DUV DB in 
answering that query 10 !. 

From the analysis of this table, we may observe that: (i) DLV DB has been always 
capable of solving the query on the maximum data size; (ii) in 11 queries out of 
15 DLV DB (in one case along with DLV /0 ) has been the only system capable of 
completing the computation within the time limit; (Hi) DLV DB allowed to handle 
up to 6.7 Gbytes of data in samegen(X,Y) and 1.6 Gbytes in reachable(X,Y) on 



Note that all facts produced by DLV DS to answer the query are considered 
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Table 1 . Execution times of the systems capable of solving the query for the max- 
imum considered size of the input data 



Query / 


Input Size (tuples) / 1 


DB-B 


DLV'O 


DLV DS 


LDL+ + 


Smodcls 


DB-A 


XSB 


Data Type 


Output size (Mbytes) | 


(sec) 


(sec) 


(sec) 


(sec) 


(sec) 


(sec) 


(sec) 



namcgcn(X, Y) 


32766 


5552 








tree 


6716 Mb 










samcgcn(bl,Y) 


4194302 


- - 64 








tree 


78 Mb 










samcgcn(bl,b2) 


4194302 


102 








tree 


78 Mb 










reachablc(X,Y) 


929945 


11820 








a- graph 


103 Mb 










reachable (bl,Y) 


929945 


1191 








a- graph 


38 Mb 










reachable (bl,b2) 


929945 


- - 4 








a- graph 


17 Mb 










reachablc(X,Y) 


612150 


11936 








c- graph 


68 Mb 










reachable (bl,Y) 


612150 


11933 








c- graph 


68 Mb 










reachable (bl,b2) 


612150 


- 981 8 








c- graph 


11 Mb 










reachablc(X,Y) 


23980 


11784 








cylinder 


465 Mb 










reachable (bl,Y) 


145260 


11654 


2284 




157 


cylinder 


279 Mb 










reachable (bl,b2) 


582120 


388 








cylinder 


13 Mb 










reachablc(X,Y) 


4194302 


11161 




7280 




tree 


1634Mb 










reachable (bl,Y) 


4194302 


- - 76 




6438 






79 Mb 










rcachablc(bl,b2) 
tree 


4194302 


60 




12 





trees within the fixed time limit of 12000 seconds and never ended its computation 
due to lack of memory, as instead other systems did. 

6.6 Comparison to DB-C 

As previously pointed out, DB-C does not support the standard SQL99 encoding 
for recursive queries, but it exploits a proprietary language for implementing a 
simplified form of recursion. This language is less expressive than SQL99 for recur- 
sion; as an example, unbound recursive queries cannot be implemented in DB-C; 
analogously, it does not allow to write recursive views in a "uniform" way (i.e., 
independently from the specific bound parameters). 

As for the problems addressed in this paper, it was not possible to write the 
unbound query either for Reachability, or for Same Generation with DB-C. The 
other queries have encodings not equivalent to the general version we adopted for 
the other systems. 

As an example, the query Qi = reachable(bl ,Y) can be expressed in DB-C by 
the following statement: 

SELECT 61, edge.att 2 FROM edge 

START WITH atti= 61 CONNECT BY PRIOR att 2 = ath 
which, however, is equivalent to the datalog program: 
reached (61). 

reached(X) :- reached (Y), edge(Y,X). 
reachable(bl , Y) :- reached(Y). 
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This is clearly a program that can be evaluated more easily than the general 
encoding, because it involves a recursive rule with one single attribute and a unique 
starting point for the recursion (the fact reached(bl)); however, this query (and the 
equivalent program) is less general than the one introduced in Section [63 since its 
structure must be modified if, for example, we need to have both the parameters 
bound or if we want to bound the second parameter instead of the first. 

Clearly, testing such encodings against the other, more general, ones would have 
been unfair. Anyway, we carried out some tests involving DB-C, by applying its en- 
codings and the corresponding datalog programs on the maximum data instances we 
considered for the various queries, in order to have a rough idea on the performance. 
As an example, for the query Q\ = reachable(bl,Y) mentioned above, on a-graphs 
(resp., c-graphs) of size 929945 (resp., 612150) tuples we have measured that DB- 
C takes 22.5 (resp., 15.9) seconds, whereas DLV DB takes 6.4 (resp., 5.6) seconds. 
Analogously, for the query Qi — samegen(bl,Y), on trees of size 4194302 tuples, 
DB-C requires 1329.4 seconds to terminate the computation, whereas DIN DB re- 
quires 500.8 seconds. DB-C performed better than DLV DB only for Reachability 
on trees; also in this case, as we have done for DB-A, we may conjecture that this 
behaviour is motivated by the particular optimization techniques implemented in 
the system. 

These results are representative of the overall performance we have measured 
for DB-C in our benchmarks; on the one hand they confirm our claim that the 
encodings solvable by DB-C are very different, also from a performance point of 
view, w.r.t. the general ones used in our benchmarks (as an example, this is proved 
by the significantly lower timing measured for DLV DB in reachable(bl,Y) w.r.t. the 
same query in the standard encoding) ; on the other hand, they allow us to conclude 
that the same reasoning as that drawn in Section |6~51 about DLV DB performance is 
still valid. 

7 Conclusions 

In this paper we have presented T)LV DB , a new deductive system for reasoning on 
massive amounts of data. It presents features of an efficient DDS but also extends 
the capability of handling data residing in external databases to a disjunctive logic 
programming system. A thorough experimental validation showed that DIN DB pro- 
vides both important speed ups in the running time of typical deductive queries 
and the capability to handle larger amounts of data w.r.t. existing systems. Inter- 
estingly, the experimental results show that DIN DB significantly outperforms both 
commercial DBMSs and other logic-based systems in the evaluation of recursive 
queries. 

The key reason for the relevant performance improvement obtained by our system 
is the integration of the following factors: (i) The idea to employ the efficient engine 
of a commercial DBMS for rule evaluation, by translating logical rules in SQL 
statements (which are then executed by a DBMS), allowing us to exploit the efficient 
data-oriented optimization techniques of relational databases, (ii) The exploitation 
of advanced optimization techniques developed in the field of deductive databases 
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for logical query optimization (like, e.g., magic sets). (Hi) A proper combination 
and a well-engineered implementation of the above ideas. Moreover, the usage of 
a purely mass-memory evaluation strategy, improves previous deductive systems 
eliminating, in practice, any limitation in the dimension of the input data. 

In the future we plan to extend the language supported by the direct database 
execution and to exploit the system in interesting research fields, such as data 
integration and data warehousing. Moreover, a mixed approach exploiting both 
DUV DB and DLV 70 executions to evaluate hard problems partially on mass-memory 
and partially in main-memory will be explored. 
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